{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "## Trouble installing fastai library?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Here is a [guide to troubleshooting](https://docs.fast.ai/troubleshoot.html) problems with fastai installation.  By far, the most common problem is having fastai installed for a different environment/different Python installation than the one your Jupyter notebook is using (you can have Python installed in multiple places on your computer and not even realize it!). Or, you might have different versions of fastai installed in your different environments/different Python installations (and the one you are running in Jupyter notebook could be out of date, even if you installed version 1.0 somewhere else). For both of these problems, please [see this entry](https://docs.fast.ai/troubleshoot.html#modulenotfounderror-no-module-named-fastaivision)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "## More detail about randomized SVD"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "I didn't cover how randomized SVD worked, because we aren't going to learn about it in detail in this course.  The main things I want you to know about randomized SVD are:\n",
    "- **it is fast**\n",
    "- **it gives us a truncated SVD** (whereas with traditional SVD, we are usually throwing away small singular values and their corresponding columns)\n",
    "\n",
    "If you were curious to know more, two keys are:\n",
    "- It is often useful to be able to reduce dimensionality of data in a way that preserves distances. The Johnson–Lindenstrauss lemma is a classic result of this type.  [Johnson-Lindenstrauss Lemma](https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma): a small set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved (proof uses random projections).\n",
    "- We haven't found a better general SVD method, we'll just use the method we have on a smaller matrix.  \n",
    "\n",
    "Below is an over-simplified version of `randomized_svd` (you wouldn't want to use this in practice, but it covers the core ideas).  The main part to notice is that we multiply our original matrix by a smaller random matrix (`M @ rand_matrix`) to produce `smaller_matrix`, and then use our same `np.linalg.svd` as before:\n",
    "\n",
    "```\n",
    "def randomized_svd(M, k=10):\n",
    "    m, n = M.shape\n",
    "    transpose = False\n",
    "    if m < n:\n",
    "        transpose = True\n",
    "        M = M.T\n",
    "        \n",
    "    rand_matrix = np.random.normal(size=(M.shape[1], k))  # short side by k\n",
    "    Q, _ = np.linalg.qr(M @ rand_matrix, mode='reduced')  # long side by k\n",
    "    smaller_matrix = Q.T @ M                              # k by short side\n",
    "    U_hat, s, V = np.linalg.svd(smaller_matrix, full_matrices=False)\n",
    "    U = Q @ U_hat\n",
    "    \n",
    "    if transpose:\n",
    "        return V.T, s.T, U.T\n",
    "    else:\n",
    "        return U, s, V\n",
    "```\n",
    "\n",
    "This code snippet is from this [randomized-SVD jupyter notebook](https://github.com/fastai/randomized-SVD/blob/master/Randomized%20SVD.ipynb) which was the demo I used for my PyBay talk on [Using randomness to make code much faster](https://www.youtube.com/watch?v=7i6kBz1kZ-A&list=PLtmWHNX-gukLQlMvtRJ19s7-8MrnRV6h6&index=7)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "## Bayes Theorem"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Ex: Physicist Leonard Mlodinow tested positive for HIV in 1989.  \n",
    "\tHis doctor said there was a 99.9% chance he had HIV.  \n",
    "   \n",
    "A = positive test results\t\t\n",
    "B = having HIV\n",
    "\n",
    "\n",
    "True positives: \t$P(A|B) = 99.9\\%$\t\n",
    "\n",
    "Prevalence:\t$P(B)= 0.01\\%$\n",
    "\n",
    "False positives:\t$P(A|B^C) = 0.1\\%$\t\n",
    "\n",
    "Was his doctor correct?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "This example is from the book:\n",
    "\n",
    "<img src=\"images/drunkards-walk.jpg\" alt=\"drunkard's walk\" style=\"width: 30%\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Bayes Theorem (for conditional probabilities): $$ P(A | B) P(B) = P(B | A) P(A) $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### Answer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "ename": "SyntaxError",
     "evalue": "invalid syntax (<ipython-input-19-c59e5f13894a>, line 3)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;36m  File \u001b[0;32m\"<ipython-input-19-c59e5f13894a>\"\u001b[0;36m, line \u001b[0;32m3\u001b[0m\n\u001b[0;31m    <img src=\"images/mlodinow-false-pos.png\" alt=\"Mlodinow\" style=\"width: 80%\"/>\u001b[0m\n\u001b[0m    ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
     ]
    }
   ],
   "source": [
    "# Exercise\n",
    "\n",
    "<img src=\"images/mlodinow-false-pos.png\" alt=\"Mlodinow\" style=\"width: 80%\"/>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "## Derivation of Naive Bayes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "We want to calculate the probability that the review \"I loved it\" is positive.  Using Bayes Theorem, we can rewrite this:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "$$ P(\\text{pos} | \\text{\"I\"}, \\text{\"loved\"}, \\text{\"it\"}) = \\frac{P(\\text{\"I\"}, \\text{\"loved\"}, \\text{\"it\"}, | \\text{pos}) \\cdot P(\\text{\"loved\"} | \\text{pos}) \\cdot P(\\text{\"it\"} | \\text{pos}) \\cdot P(\\text{pos})}{P(\\text{\"I\"}, \\text{\"loved\"}, \\text{\"it})}$$\n",
    "\n",
    "The \"naive\" part of Naive Bayes is that we will assume that the probabilities of the different words are all independent.\n",
    "\n",
    "$$ P(\\text{pos} | \\text{\"I\"}, \\text{\"loved\"}, \\text{\"it\"}) = \\frac{P(\\text{\"I\"} | \\text{pos}) \\cdot P(\\text{\"loved\"} | \\text{pos}) \\cdot P(\\text{\"it\"} | \\text{pos}) \\cdot P(\\text{pos})}{P(\\text{\"I\"}, \\text{\"loved\"}, \\text{\"it})}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "We do the same calculation to see how likely it is the review is negative, and then choose whichever is larger."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "$$ P(\\text{neg} | \\text{\"I\"}, \\text{\"loved\"}, \\text{\"it\"}) = \\frac{P(\\text{\"I\"} | \\text{neg}) \\cdot P(\\text{\"loved\"} | \\text{neg}) \\cdot P(\\text{\"it\"} | \\text{neg}) \\cdot P(\\text{neg})}{P(\\text{\"I\"}, \\text{\"loved\"}, \\text{\"it})}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "We will add one to avoid dividing by zero (or something close to it).  Similarly, we take logarithms to avoid multiplying by a lot of tiny values.  For the reasons we want to avoid this, please see the next section on numerical stability:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "More reading: [Using log-probabilities for Naive Bayes](http://www.cs.rhodes.edu/~kirlinp/courses/ai/f18/projects/proj3/naive-bayes-log-probs.pdf)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "## Numerical Stability"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "#### Exercise"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Take a moment to look at the function $f$ below.  Before you try running it, write on paper what the output would be of $x_1 = f(\\frac{1}{10})$.  Now, (still on paper) plug that back into $f$ and calculate $x_2 = f(x_1)$.  Keep going for 10 iterations.\n",
    "\n",
    "This example is taken from page 107 of *Numerical Methods*, by Greenbaum and Chartier."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "def f(x):\n",
    "    if x <= 1/2:\n",
    "        return 2 * x\n",
    "    if x > 1/2:\n",
    "        return 2*x - 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Only after you've written down what you think the answer should be, run the code below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0.1\n",
      "0.2\n",
      "0.4\n",
      "0.8\n",
      "0.6000000000000001\n",
      "0.20000000000000018\n",
      "0.40000000000000036\n",
      "0.8000000000000007\n",
      "0.6000000000000014\n",
      "0.20000000000000284\n",
      "0.4000000000000057\n",
      "0.8000000000000114\n",
      "0.6000000000000227\n",
      "0.20000000000004547\n",
      "0.40000000000009095\n",
      "0.8000000000001819\n",
      "0.6000000000003638\n",
      "0.2000000000007276\n",
      "0.4000000000014552\n",
      "0.8000000000029104\n",
      "0.6000000000058208\n",
      "0.20000000001164153\n",
      "0.40000000002328306\n",
      "0.8000000000465661\n",
      "0.6000000000931323\n",
      "0.20000000018626451\n",
      "0.40000000037252903\n",
      "0.8000000007450581\n",
      "0.6000000014901161\n",
      "0.20000000298023224\n",
      "0.4000000059604645\n",
      "0.800000011920929\n",
      "0.6000000238418579\n",
      "0.20000004768371582\n",
      "0.40000009536743164\n",
      "0.8000001907348633\n",
      "0.6000003814697266\n",
      "0.20000076293945312\n",
      "0.40000152587890625\n",
      "0.8000030517578125\n",
      "0.600006103515625\n",
      "0.20001220703125\n",
      "0.4000244140625\n",
      "0.800048828125\n",
      "0.60009765625\n",
      "0.2001953125\n",
      "0.400390625\n",
      "0.80078125\n",
      "0.6015625\n",
      "0.203125\n",
      "0.40625\n",
      "0.8125\n",
      "0.625\n",
      "0.25\n",
      "0.5\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n",
      "1.0\n"
     ]
    }
   ],
   "source": [
    "x = 1/10\n",
    "for i in range(80):\n",
    "    print(x)\n",
    "    x = f(x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "What went wrong?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### Problem: math is continuous & infinite, but computers are discrete & finite"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Two Limitations of computer representations of numbers:\n",
    "1. they can't be arbitrarily large or small\n",
    "2. there must be gaps between them\n",
    "\n",
    "The reason we need to care about accuracy, is because computers can't store infinitely accurate numbers.  It's possible to create calculations that give very wrong answers (particularly when repeating an operation many times, since each operation could multiply the error)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "How computers store numbers:\n",
    "\n",
    "<img src=\"images/fpa.png\" alt=\"floating point\" style=\"width: 60%\"/>\n",
    "\n",
    "The *mantissa* can also be referred to as the *significand*."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "IEEE Double precision arithmetic:\n",
    "- Numbers can be as large as $1.79 \\times 10^{308}$ and as small as $2.23 \\times 10^{-308}$.\n",
    "- The interval $[1,2]$ is represented by discrete subset: \n",
    "$$1, \\: 1+2^{-52}, \\: 1+2 \\times 2^{-52},\\: 1+3 \\times 2^{-52},\\: \\ldots, 2$$\n",
    "\n",
    "- The interval $[2,4]$ is represented:\n",
    "$$2, \\: 2+2^{-51}, \\: 2+2 \\times 2^{-51},\\: 2+3 \\times 2^{-51},\\: \\ldots, 4$$\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Floats and doubles are not equidistant:\n",
    "\n",
    "<img src=\"images/fltscale-wh.png\" alt=\"floating point\" style=\"width: 100%\"/>\n",
    "Source: [What you never wanted to know about floating point but will be forced to find out](http://www.volkerschatz.com/science/float.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "**Machine Epsilon**\n",
    "\n",
    "Half the distance between 1 and the next larger number. This can vary by computer.  IEEE standards for double precision specify $$ \\varepsilon_{machine} = 2^{-53} \\approx 1.11 \\times 10^{-16}$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "**Two important properties of Floating Point Arithmetic**:\n",
    "\n",
    "- The difference between a real number $x$ and its closest floating point approximation $fl(x)$ is always smaller than $\\varepsilon_{machine}$ in relative terms.  For some $\\varepsilon$, where $\\lvert \\varepsilon \\rvert \\leq \\varepsilon_{machine}$, $$fl(x)=x \\cdot (1 + \\varepsilon)$$\n",
    "\n",
    "- Where * is any operation ($+, -, \\times, \\div$), and $\\circledast$ is its floating point analogue,\n",
    "    $$ x \\circledast y = (x * y)(1 + \\varepsilon)$$\n",
    "for some $\\varepsilon$, where $\\lvert \\varepsilon \\rvert \\leq \\varepsilon_{machine}$\n",
    "That is, every operation of floating point arithmetic is exact up to a relative error of size at most $\\varepsilon_{machine}$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "## Speed of different types of memory"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "This course is 90% NLP and 10% things I want to make sure you see before the end of your MSDS."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Here are some *numbers everyone should know* (from the legendary [Jeff Dean](http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf)):\n",
    "- L1 cache reference 0.5 ns\n",
    "- L2 cache reference 7 ns\n",
    "- Main memory reference/RAM 100 ns\n",
    "- Send 2K bytes over 1 Gbps network 20,000 ns\n",
    "- Read 1 MB sequentially from memory 250,000 ns\n",
    "- Round trip within same datacenter 500,000 ns\n",
    "- Disk seek 10,000,000 ns\n",
    "- Read 1 MB sequentially from network 10,000,000 ns\n",
    "- Read 1 MB sequentially from disk 30,000,000 ns\n",
    "- Send packet CA->Netherlands->CA 150,000,000 ns\n",
    "\n",
    "And here is an updated, interactive [version](https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html), which includes a timeline of how these numbers have changed.\n",
    "\n",
    "**Key take-away**: Each successive memory type is (at least) an order of magnitude worse than the one before it.  Disk seeks are **very slow**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "## Revisiting Naive Bayes in an Excel Spreadsheet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Let's calculate naive bayes in a spreadsheet to get a more visual picture of what is going on.  Here's how I processed the data for this:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### Loading our data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "from fastai import *\n",
    "from fastai.text import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "path = untar_data(URLs.IMDB_SAMPLE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "movie_reviews = (TextList.from_csv(path, 'texts.csv', cols='text')\n",
    "                         .split_from_df(col=2)\n",
    "                         .label_from_df(cols=0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "def get_term_doc_matrix(label_list, vocab_len):\n",
    "    j_indices = []\n",
    "    indptr = []\n",
    "    values = []\n",
    "    indptr.append(0)\n",
    "\n",
    "    for i, doc in enumerate(label_list):\n",
    "        feature_counter = Counter(doc.data)\n",
    "        j_indices.extend(feature_counter.keys())\n",
    "        values.extend(feature_counter.values())\n",
    "        indptr.append(len(j_indices))\n",
    "        \n",
    "#     return (values, j_indices, indptr)\n",
    "\n",
    "    return scipy.sparse.csr_matrix((values, j_indices, indptr),\n",
    "                                   shape=(len(indptr) - 1, vocab_len),\n",
    "                                   dtype=int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "hidden": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "trn_term_doc = get_term_doc_matrix(movie_reviews.train.x, len(movie_reviews.vocab.itos))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "### Getting data for our spreadsheet"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "To keep our spreadsheet manageable, we will just get the 40 shortest reviews:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "hidden": true,
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "inds = np.argpartition(np.count_nonzero(trn_term_doc.todense(), 1), 40, axis=0)[:40]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "inds = np.squeeze(np.asarray(inds))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Let's get the text from these 40 shortest reviews:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "list_text = [movie_reviews.train.x[i].text for i in inds]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Get counts for all vocab used in our selection of the 40 shortest reviews:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "hidden": true,
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "vocab_used = defaultdict(int)\n",
    "\n",
    "for i in inds:\n",
    "     for val in movie_reviews.train.x[i].data:\n",
    "        vocab_used[val] += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Let's choose the words that are used at least 6 times (so not too rare), but less than 30 (so not too common).  You could try experimenting with different cut-off points on your own:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "interesting_inds = [key for key, val in vocab_used.items() if val < 30 and val >6]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "44"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(interesting_inds)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "I copied the vocab and text of the movie reviews directly from here to paste into the spreadsheet:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['so',\n",
       " 'bad',\n",
       " 'very',\n",
       " 'acting',\n",
       " '\\n \\n ',\n",
       " 'do',\n",
       " 'xxup',\n",
       " 'film',\n",
       " 'from',\n",
       " 'some',\n",
       " 'if',\n",
       " 'are',\n",
       " 'for',\n",
       " 'up',\n",
       " 'one',\n",
       " 'have',\n",
       " 'all',\n",
       " 'an',\n",
       " 'no',\n",
       " 'just',\n",
       " 'like',\n",
       " 'good',\n",
       " 'great',\n",
       " 'but',\n",
       " '...',\n",
       " 'about',\n",
       " 'movies',\n",
       " 'seen',\n",
       " 'with',\n",
       " '!',\n",
       " 'me',\n",
       " 'as',\n",
       " \"'s\",\n",
       " 'was',\n",
       " 'that',\n",
       " 'out',\n",
       " '\"',\n",
       " 'on',\n",
       " \"n't\",\n",
       " 'story',\n",
       " '-',\n",
       " '(',\n",
       " ')',\n",
       " 'not']"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "[movie_reviews.vocab.itos[i] for i in interesting_inds]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 541,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[\"xxbos xxmaj this movie is so bad , i knew how it ends right after this little girl killed the first person . xxmaj very bad acting very bad plot very bad movie \\n \\n  do yourself a xxunk and xxup don't watch it 1 / 10\",\n",
       " 'xxbos i found this film to funny from the start . xxmaj john xxmaj waters use of characters reminded of some of the down to earth characters from xxmaj xxunk films . xxmaj christina xxmaj xxunk has once again xxunk her abilities in this film . xxmaj if you are looking for a fun movie without xxunk , i recommend this film .',\n",
       " 'xxbos xxmaj if you ever see a stand up comedy movie this is the one . xxmaj you will laugh xxunk if you have any sense of humor at all . xxmaj this is a once in a lifetime performance from a once in a lifetime performer . xxmaj this is a stand up standard .',\n",
       " 'xxbos xxmaj another movie to suffer without an adventure to run , no xxunk to solve . xxmaj just an xxunk man , acting like an animal . xxmaj no a good reason to take this journey . xxmaj pitt and xxmaj lewis are great actors ; magnificent xxmaj michelle xxmaj xxunk but a weak xxmaj david xxmaj duchovny performance ...',\n",
       " 'xxbos xxmaj this is an hilarious movie . xxmaj one of the very best things about it is the quality of the performance by each actor . xxmaj from the xxunk role to the xxunk , each character is vivid , xxunk and so understandable . xxmaj it can also make you laugh so hard your health will improve .',\n",
       " 'xxbos xxmaj this is by far one of the worst movies i have ever seen , the poor special effects along with the poor acting are just a few of the things wrong with this film . i am fan of the first two major xxunk but this one is lame !',\n",
       " \"xxbos xxmaj the direction struck me as poor man 's xxmaj xxunk xxmaj bergman . xxmaj the xxunk dialogue was annoying . xxmaj the xxunk xxunk that all characters except xxmaj xxunk ' showed made me think they were drugged . i think the director ruined it for me .\",\n",
       " 'xxbos xxmaj omar xxmaj xxunk is an outstanding actor . i really think he gets into his character xxunk . xxmaj when xxunk gets shot he shows true emotion . xxmaj he also shows true emotion when xxunk puts the gun to him in the room . xxmaj omar is a very talented actor ! !',\n",
       " 'xxbos xxmaj the greatest xxunk in life is being dull , and this movie is xxunk boring . its funny , its left out of his \" a life in film \" documentary . xxmaj he goes from a long piece on \" xxmaj xxunk xxmaj memories \" and then fast forwards to \" xxmaj xxunk \" . xxmaj this little piece of xxunk xxunk just is n\\'t worth the effort .',\n",
       " \"xxbos xxmaj the way the story is developed , keeps the audience wondering what is the xxunk 's dark past . xxmaj we get some clues during the series , but enough to keep us interested in the mini - series . xxmaj the characters are all believable and i personally felt xxunk and xxunk by the story .\",\n",
       " 'xxbos xxmaj this movie is one of the masterpieces from xxmaj mr. xxmaj xxunk . xxmaj it is about youth , xxunk , happiness , xxunk , xxunk , honor , corruption . xxmaj and it is like everything else from great xxmaj italian director xxunk art . \\n \\n ',\n",
       " \"xxbos xxmaj it 's terrific when a funny movie does n't make smile you . xxmaj what a pity ! ! xxmaj this film is very boring and so long . xxmaj it 's simply xxunk . xxmaj the story is staggering without goal and no fun . \\n \\n  xxmaj you feel better when it 's finished .\",\n",
       " 'xxbos \" xxmaj foxes \" is a great film . xxmaj the four young actresses xxmaj xxunk xxmaj foster , xxmaj xxunk xxmaj xxunk , xxmaj marilyn xxmaj xxunk and xxmaj xxunk xxmaj xxunk are wonderful . xxmaj the song \" xxmaj on the radio \" by xxmaj donna xxmaj summer is lovely . a great film . xxrep 5 *',\n",
       " 'xxbos xxmaj this movie features xxmaj charlie xxmaj xxunk dancing in a strip club . xxmaj beyond that , it features a truly bad script with dull , unrealistic dialogue . xxmaj that it got as many positive xxunk suggests some people may be joking .',\n",
       " 'xxbos xxmaj this film is brilliant ! xxmaj it touches everyone who sees it in an extraordinary way . xxmaj it really takes you back to your youth and puts a new perspective on how you view your childhood memories . xxmaj there are so many layers to this film . xxmaj it is innovative and absolutely fabulous !',\n",
       " \"xxbos i understand there was some conflict between xxmaj xxunk and the great xxmaj xxunk xxmaj smith during the filming . xxmaj understandable when you put one of the world 's greatest actresses of all time ( xxmaj smith , of course ) with one whose performances seem to get worse with each subsequent film .\",\n",
       " \"xxbos ... okay , maybe not all of it . xxmaj xxunk by the false promise of bikini - xxunk women on the movie 's cover ... but the xxup horror ... xxup the xxup horror ... ... whatever you do , do xxup not watch this movie . xxmaj xxunk out your eyes , repeatedly xxunk your xxunk in ... do what it takes . xxmaj never again -- never forget ! \\n \\n \",\n",
       " 'xxbos a bit of xxmaj trivia b / c i ca n\\'t figure out how to submit xxmaj trivia : xxmaj in the backdrop of this performance , one of the images is \\n \\n  xxmaj george xxmaj xxunk \\'s \" a xxmaj sunday xxmaj afternoon on the xxmaj island of xxmaj la xxmaj grande xxmaj xxunk \" painting ( seen best in chapter 18 ) , this painting is the subject of a xxmaj xxunk musical xxmaj sunday in the xxmaj park with xxmaj george . \\n \\n  a bit of xxmaj trivia b / c i ca n\\'t figure out how to submit xxmaj trivia : xxmaj in the backdrop of this performance , one of the images is \\n \\n  xxmaj george xxmaj xxunk \\'s \" a xxmaj sunday xxmaj afternoon on the xxmaj island of xxmaj la xxmaj grande xxmaj xxunk \" painting ( seen best in chapter 18 ) , this painting is the subject of a xxmaj xxunk musical xxmaj sunday in the xxmaj park with xxmaj george .',\n",
       " \"xxbos xxmaj it 's a good movie if you plan to watch lots of landscapes and animals , like an animal documentary . xxmaj and making xxmaj pierce xxmaj xxunk an indian make you wonder ' xxmaj does all those people do n't recognize if someone is n't indian at plain sight ? '\",\n",
       " \"xxbos xxmaj very nice action with an xxunk story which actually does n't suck . xxmaj interesting enough to merit watching instead of skipping past to get to the good parts . xxmaj having xxmaj jenna xxmaj xxunk and xxmaj asia xxmaj xxunk helps xxunk it up , too . xxmaj jenna in that xxunk and those xxunk is just xxunk ! xxmaj worth picking up just to see her !\",\n",
       " \"xxbos a very realistic portrait of a broken family and the effect it has on the kid caught in between . xxmaj as a child of divorced parents i was totally relating to events in the film . xxmaj also - a really cool zombie twist which i thought was xxup very xxup original . i 'm tired of the same old stuff in movies . a very realistic portrait of a broken family and the effect it has on the kid caught in between . xxmaj as a child of divorced parents i was totally relating to events in the film . xxmaj also - a really cool zombie twist which i thought was xxup very xxup original . i 'm tired of the same old stuff in movies . a very realistic portrait of a broken family and the effect it has on the kid caught in between . xxmaj as a child of divorced parents i was totally relating to events in the film . xxmaj also - a really cool zombie twist which i thought was xxup very xxup original . i 'm tired of the same old stuff in movies .\",\n",
       " \"xxbos xxmaj the wit and pace and three show stopping xxmaj xxunk xxmaj xxunk numbers put this ahead of the over - rated xxunk xxmaj street . xxmaj this is the xxunk 30 's musical with a knockout xxunk performance from xxmaj jimmy xxmaj cagney . xxmaj one of the last xxunk before the xxmaj motion xxmaj picture xxmaj production xxmaj code was strictly xxunk . a must see .\",\n",
       " 'xxbos xxmaj this film is scary because you can find yourself relating to ideas they have and can recall other people saying and having xxunk ideas make this a haunting well done movie xxrep 4 . the xxunk style is not xxunk to point it xxunk you out of film like blair witch it only adds to the raw \" real \" feeling of the film that makes it .',\n",
       " \"xxbos i did enjoy this film , i thought it ended up being an old fashioned love story with a few twists . i expected him to get the girl , i wo n't tell you if he does or not you will need to watch the movie to find out . xxmaj overall if you are looking to watch a love story this one will suffice .\",\n",
       " 'xxbos xxmaj this movie was so bad that my xxunk . went down about 40 points after seeing it . xxmaj it made me wonder who could sit through the weeks it took to make it and think that it was worth it . xxmaj it must of been some kind of personal favor to xxmaj van xxmaj damme .',\n",
       " 'xxbos i can not believe how unknown this movie is , it was absolutely incredible . xxmaj the ending alone has stuck with me for almost thirty years . xxmaj the road sign through the xxunk mirror blew me away . xxmaj if you liked \" xxup race xxup with xxup the xxup devil \" you will love this movie',\n",
       " \"xxbos xxmaj this was a horrible film ! i gave it 2 xxmaj points , one for xxmaj xxunk xxmaj jolie and a second one for the beautiful xxmaj xxunk in the beginning ... xxmaj other than that the story just plain sucked and cars xxunk through cities was n't so new in 1970 . xxmaj the xxmaj xxunk was probably what annoyed me the most , xxunk seen anything so constructed !\",\n",
       " \"xxbos xxmaj good actors and good performances ca n't mask a pointless script , bad dialogue , and patterns of behavior xxunk into nothing you 'd care about . xxmaj the most interesting character is xxmaj david xxmaj xxunk . xxmaj no character development - no xxunk , no interest , just some suffering for no particular reason , teaching us nothing and not even xxunk to entertain .\",\n",
       " 'xxbos xxmaj yes , i \\'m sentimental & xxunk ! ! xxmaj but this movie ( and it \\'s theme song ) remain one of my all time xxunk ! ! xxmaj robert xxmaj xxunk xxmaj jr. does such justice to the role of \" xxmaj louis xxmaj xxunk \" xxunk and the storyline ( although far - xxunk ) is romantic & makes one believe in happy xxunk ! !',\n",
       " \"xxbos i thoroughly enjoyed this film for its humor and xxunk . i especially like the way the characters welcomed xxmaj gina 's various suitors . xxmaj with friends ( and family ) like these anyone would feel xxunk and loved . i found the writing witty and natural and the actors made the material come alive .\",\n",
       " \"xxbos i love this movie . i mean the story may not be the best , but the dancing most certainly makes up for it . xxmaj you get to know a little bit about each character the way xxup they want you to learn about them . i just think that you wo n't like this movie unless you 're into xxmaj broadway ...\",\n",
       " \"xxbos i and a friend rented this movie . xxmaj we both found the movie soundtrack and production techniques to be xxunk . xxmaj the movie 's plot appeared to drag on throughout with little surprise in the ending . xxmaj we both xxunk that the movie could have been xxunk into roughly an hour giving it more suspense and moving plot .\",\n",
       " 'xxbos xxmaj this movie was amazingly bad . i do n\\'t think i \\'ve ever seen a movie where every attempt at humor failed as miserably . xxmaj let \\'s see ... the acting was pathetic , the \" special effects \" where horrible , the plot non - xxunk ... that pretty much xxunk up this movie . \\n \\n ',\n",
       " 'xxbos xxmaj this is a worthless sequel to a great action movie . xxmaj cheap looking , and worst of all , xxup boring xxup action xxup scenes ! xxmaj the only decent thing about the movie is the last fight sequence . xxmaj only xxunk minutes , but it feels like it goes on forever ! xxmaj even die - hard xxmaj van xxmaj damme xxunk myself ) should avoid this one !',\n",
       " 'xxbos xxmaj beautiful and touching movie . xxmaj rich colors , great settings , good acting and one of the most charming movies i have seen in a while . i never saw such an interesting setting when i was in xxmaj china . xxmaj my wife liked it so much she asked me to xxunk on and rate it so other would enjoy too .',\n",
       " \"xxbos xxmaj this film is the worst film i have ever seen . xxmaj the story line is weak - i could n't even follow it . xxmaj the acting is high - xxunk . xxmaj the sound track is irritating . xxmaj the attempts at humor are not . xxmaj the editing is horrible . xxmaj the credits are even slow - i would be embarrassed to have my name associated with this waste of film . xxmaj do n't waste your time even thinking about this attempt at acting .\",\n",
       " 'xxbos i enjoyed this movie . xxmaj unlike like some of the xxunk up , xxunk trash that is passed off as action movies , xxmaj playing xxmaj god is simple and realistic , with characters that are believable , action that is not over the top and enough twists and turns to keep you interested until the end . \\n \\n  xxmaj well directed , well acted and a good story .',\n",
       " 'xxbos i caught this movie about 8 years ago , and have never had it of my mind . surely someone out there will release it on xxmaj video , or hey why not xxup dvd ! xxmaj the ford xxunk is the star xxrep 7 . if you have any head for cars xxup watch xxup this and be blown away .',\n",
       " \"xxbos all i have to say is if you do n't like it then there is something wrong with you . plus xxmaj jessica is just all kinds of hot xxrep 5 ! the only reason you may not like it is because it is set in the future where xxmaj xxunk has gone to hell . that and you my not like it cause the future they show could very well happen .\",\n",
       " \"xxbos xxmaj this tale set in xxmaj xxunk , xxmaj new xxmaj zealand xxunk ( xxmaj xxunk xxunk of the xxunk xxmaj xxunk xxmaj college ) is xxunk 's first feature . \\n \\n  xxmaj with a contemporary xxmaj new xxmaj zealand xxunk xxmaj via xxmaj xxunk xxunk with absolutely hilarious situations which develop in the ( adult ) family context . xxmaj at the same time it manages to xxunk intense emotions of sadness and despair . \\n \\n  xxmaj one of the most moving and xxunk movies of the year - not to be missed !\"]"
      ]
     },
     "execution_count": 541,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list_text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 605,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "x = trn_term_doc[inds,:]\n",
    "y = movie_reviews.train.y[inds]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "#### Export to CSVs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "hidden": true
   },
   "source": [
    "Let's export the term-document matrix and the labels to CSVs.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 493,
   "metadata": {
    "hidden": true
   },
   "outputs": [],
   "source": [
    "from IPython.display import FileLink, FileLinks"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 607,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<a href='x.csv' target='_blank'>x.csv</a><br>"
      ],
      "text/plain": [
       "/home/racheltho/git/nlp/x.csv"
      ]
     },
     "execution_count": 607,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.savetxt(\"x.csv\", x.todense()[:,interesting_inds], delimiter=\",\", fmt='%.14f')\n",
    "FileLink('x.csv')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 584,
   "metadata": {
    "hidden": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<a href='y.csv' target='_blank'>y.csv</a><br>"
      ],
      "text/plain": [
       "/home/racheltho/git/nlp/y.csv"
      ]
     },
     "execution_count": 584,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.savetxt(\"y.csv\", y, delimiter=\",\", fmt=\"%i\")\n",
    "FileLink('y.csv')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
